task difficulty
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Austria > Salzburg > Salzburg (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
TaskSense: Cognitive Chain Modeling and Difficulty Estimation for GUI Tasks
Yin, Yiwen, Hu, Zhian, Xu, Xiaoxi, Yu, Chun, Wu, Xintong, Fan, Wenyu, Shi, Yuanchun
Measuring GUI task difficulty is crucial for user behavior analysis and agent capability evaluation. Yet, existing benchmarks typically quantify difficulty based on motor actions (e.g., step counts), overlooking the cognitive demands underlying task completion. In this work, we propose Cognitive Chain, a novel framework that models task difficulty from a cognitive perspective. A cognitive chain decomposes the cognitive processes preceding a motor action into a sequence of cognitive steps (e.g., finding, deciding, computing), each with a difficulty index grounded in information theories. We develop an LLM-based method to automatically extract cognitive chains from task execution traces. Validation with linear regression shows that our estimated cognitive difficulty correlates well with user completion time (step-level R-square=0.46 after annotation). Assessment of state-of-the-art GUI agents shows reduced success on cognitively demanding tasks, revealing capability gaps and Human-AI consistency patterns. We conclude by discussing potential applications in agent training, capability assessment, and human-agent delegation optimization.
- North America > United States > Washington > King County > Seattle (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- Workflow (1.00)
- Research Report (1.00)
Think How to Think: Mitigating Overthinking with Autonomous Difficulty Cognition in Large Reasoning Models
Liu, Yongjiang, Li, Haoxi, Ma, Xiaosong, Zhang, Jie, Guo, Song
Recent Large Reasoning Models (LRMs) excel at complex reasoning tasks but often suffer from overthinking, generating overly long and redundant reasoning trajectories. To explore its essence, our empirical analysis reveals that LRMs are primarily limited to recognizing task properties (i.e., difficulty levels) like humans before solving the problem, leading to a one-size-fits-all reasoning process. Inspired by this, a pressing and natural question emerges: Can we explicitly bootstrap such ability to alleviate overthinking in LRMs? In this paper, we propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs' difficulty cognition and redundancy cognition of LRMs. Specifically, we first inject difficulty hypnosis into output prefixes to guide the model toward adaptive reasoning depth, trained on a hybrid dataset mixing short and long reasoning paths. Then, we incorporate redundancy hypnosis, which supervises the intermediate reasoning steps to identify and eliminate unnecessary reasoning patterns. Experiments on 7B/14B/32B models demonstrate that TH2T significantly reduces inference costs by over 70% on easy tasks and 40% on hard tasks while maintaining performance stability. The resulting outputs exhibit clear signs of difficulty-aware capabilities and reduced redundancy (e.g., reflection and looping).
Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression
Singh, Joykirat, Chen, Justin Chih-Yao, Prasad, Archiki, Stengel-Eskin, Elias, Nambi, Akshay, Bansal, Mohit
Recent thinking models are capable of solving complex reasoning tasks by scaling test-time compute across various domains, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under-and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model's self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks. Recent advancements in thinking models have enabled language models to solve complex reasoning tasks (DeepSeek-AI et al., 2025; OpenAI et al., 2024; Team, 2025). These models extend the chain-of-thought (CoT; Wei et al., 2023) paradigm with online reinforcement learning (RL; Shao et al., 2024), allowing them to refine intermediate solutions as well as sequentially scaling the number of tokens (i.e., compute) to arrive at the final answer. While such approaches show strong promise for harder problems in domains like mathematics, programming, and logical puzzles (Xie et al., 2025; Chen et al., 2025), their accuracy and utility remain capped by a failure to regulate their reasoning length.
- North America > United States > Virginia (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Instructional Material (0.68)
- Research Report (0.66)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
- (2 more...)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Austria > Salzburg > Salzburg (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
SubCDM: Collective Decision-Making with a Swarm Subset
Fuady, Samratul, Tarapore, Danesh, Soorati, Mohammad D.
-- Collective decision-making is a key function of autonomous robot swarms, enabling them to reach a consensus on actions based on environmental features. Existing strategies require the participation of all robots in the decision-making process, which is resource-intensive and prevents the swarm from allocating the robots to any other tasks. We propose Subset-Based Collective Decision-Making (SubCDM), which enables decisions using only a swarm subset. The construction of the subset is dynamic and decentralized, relying solely on local information. Our method allows the swarm to adaptively determine the size of the subset for accurate decision-making, depending on the difficulty of reaching a consensus. Simulation results using one hundred robots show that our approach achieves accuracy comparable to using the entire swarm while reducing the number of robots required to perform collective decision-making, making it a resource-efficient solution for collective decision-making in swarm robotics. Swarm robotics is a rapidly growing area of research, gaining significant attention due to its broad potential applications across various fields [1].
- Europe > United Kingdom (0.04)
- Asia > Indonesia (0.04)
C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
Chen, Xiuwei, Hu, Wentao, Li, Hanhui, Zhou, Jun, Chen, Zisheng, Cao, Meng, Zeng, Yihan, Zhang, Kui, Yuan, Yu-Jie, Han, Jianhua, Xu, Hang, Liang, Xiaodan
Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Xu, Chenjun, Wen, Bingbing, Han, Bin, Wolfe, Robert, Wang, Lucy Lu, Howe, Bill
Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas -- e.g., expert vs layman, or different race, gender, and ages -- the models will respond with stereotypically biased confidence estimations even though their underlying answer accuracy remains the same. Based on these observations, we propose Answer-Free Confidence Estimation (AFCE) to improve confidence calibration and LLM interpretability in these settings. AFCE is a self-assessment method that employs two stages of prompting, first eliciting only confidence scores on questions, then asking separately for the answer. Experiments on the MMLU and GPQA datasets spanning subjects and difficulty show that this separation of tasks significantly reduces overconfidence and delivers more human-like sensitivity to task difficulty.
- Education > Curriculum > Subject-Specific Education (1.00)
- Health & Medicine (0.67)